Medicinal Chrysanthemum Detection under Complex Environments Using the MC-LCNN Model

Medicinal chrysanthemum detection is one of the desirable tasks of selective chrysanthemum harvesting robots. However, it is challenging to achieve accurate detection in real time under complex unstructured field environments. In this context, we propose a novel lightweight convolutional neural network for medicinal chrysanthemum detection (MC-LCNN). First, in the backbone and neck components, we employed the proposed residual structures MC-ResNetv1 and MC-ResNetv2 as the main network and embedded the custom feature extraction module and feature fusion module to guide the gradient flow. Moreover, across the network, we used a custom loss function to improve the precision of the proposed model. The results showed that under the NVIDIA Tesla V100 GPU environment, the inference speed could reach 109.28 FPS per image (416 × 416), and the detection precision (AP50) could reach 93.06%. Not only that, we embedded the MC-LCNN model into the edge computing device NVIDIA Jetson TX2 for real-time object detection, adopting a CPU–GPU multithreaded pipeline design to improve the inference speed by 2FPS. This model could be further developed into a perception system for selective harvesting chrysanthemum robots in the future.


Introduction
Numerous studies have reported that medicinal chrysanthemums have significant commercial value [1]. Furthermore, it has prominent medicinal values [2], such as heat clearing, eye brightening, anti-inflammatory, antihypertensive, and antitumor properties. In the natural environment, a single chrysanthemum plant can present flower heads in different flowering stages, whereas medicinal chrysanthemums are mainly harvested at the bud stage. To show the research objective of this study, the different flowering stages of medicinal chrysanthemums are presented in Figure 1.
At present, the harvesting process of medicinal chrysanthemums is labor-intensive and time-consuming. Consequently, due to the current shortage of skilled labor, it is highly desirable to develop a selective harvesting robot to solve the crop waste problem. The design of manipulators and the development of visual perception systems are vital for selective harvesting robots, and this study is focused on the development of visual perception systems for medicinal chrysanthemums. Traditional machine learning techniques for computer vision tasks are well developed, with shallow learning of image information through manual feature extraction [3]. Convolutional neural networks (CNNs), an important subset of machine learning techniques that learn hierarchical representations and discover potentially complex patterns from the data, have made impressive advances in the computer vision field [4]. CNNs have also yielded encouraging results in agriculture [5].
agriculture [5]. Although the approaches based on traditional machine learning techniques and deep learning techniques have achieved significant success in agricultural applications, developing lightweight networks for selective harvesting robots under unstructured environments is still difficult. We collected the literature on chrysanthemum detection based on traditional machine learning techniques and deep learning techniques throughout the world, and the results are shown in Table 1. Overall, the available literature is relatively scarce. When carefully analyzing Table 1, we found three issues that deserve further exploration.
Issue 1: The current research has not yet achieved high-accuracy, real-time detection of chrysanthemums.
Issue 2: Throughout the literature, the testing environment has mainly been in the laboratory, which cannot guarantee the robustness of the model. Issue 3: Although there are some differences in the research tasks for chrysanthemum detection, the aim of the research is to achieve commercialization, and this could be effective in helping farmers reduce their workload. Commercial production inevitably requires embedding the models into low-power edge computing devices, but the current test results are laptop-based.  [6] Chrysanthemum cut detection 1996 Ideal / / Laptop [7] Chrysanthemum leaf recognition 2000 Ideal / / Laptop [8] Chrysanthemum bud testing 2014 Ideal 0.75 / Laptop [9] Chrysanthemum disease detection 2017 Ideal / / Laptop [10] Chrysanthemum variety testing 2018 Illumination 0.85 0.4 s Laptop [11] Chrysanthemum picking 2019 Illumination 0.9 0.7 s Laptop [12] Chrysanthemum variety classification 2019 Ideal 0.78 10 ms Laptop [1] Chrysanthemum variety classification 2020 Ideal 0.96 / Laptop [13] Chrysanthemum image recognition 2020 Ideal 0.76 0.3 s Laptop We collected the literature on chrysanthemum detection based on traditional machine learning techniques and deep learning techniques throughout the world, and the results are shown in Table 1. Overall, the available literature is relatively scarce. When carefully analyzing Table 1, we found three issues that deserve further exploration.
Issue 1: The current research has not yet achieved high-accuracy, real-time detection of chrysanthemums.
Issue 2: Throughout the literature, the testing environment has mainly been in the laboratory, which cannot guarantee the robustness of the model. Issue 3: Although there are some differences in the research tasks for chrysanthemum detection, the aim of the research is to achieve commercialization, and this could be effective in helping farmers reduce their workload. Commercial production inevitably requires embedding the models into low-power edge computing devices, but the current test results are laptop-based. The rest of the paper is organized as follows. Section 2 describes the dataset, the hardware parameters of the NVIDIA Jetson TX2, the structure of the proposed model, the improvement approach of multithreading, the evaluation metrics, and the experimental setup. Section 3 presents the experimental results in detail. Section 4 discusses the experimental results, advantages and disadvantages, solutions, and future research perspectives of this work. Section 5 briefly summarizes the contributions of this study.

Dataset
The medicinal chrysanthemum dataset used in this study was collected at Yangma Town, China, from October 2019 to October 2021. Due to the short flowering stage of medicinal chrysanthemums, there are only a few days per year to collect suitable samples. The capture device was an Apple X phone with a video resolution of 1080 × 1920. The dataset was collected entirely in the field, with backgrounds including illumination variations, occlusions, and overlaps. It is worth mentioning that to ensure the robustness of the robotic perception system, the collected images had no natural environmental constraints. The dataset comprising a total of 4000 chrysanthemum images was divided into training, validation, and test datasets following a ratio of 6:3:1. Some original images are shown in Figure 2.

NVIDIA Jetson TX2
The NVIDIA Jetson TX2 comprises a 6-core ARMv8 64-bit CPU complex and a 256-core NVIDIA Pascal architecture GPU. The CPU complex includes a dual-core Denver2 processor, a quad-core ARM Cortex-A57, 8 GB of LPDDR4 memory, and a 128 bit interface, making it ideal for applications with low power and high computing performance. Therefore, we chose this edge computing device to design and implement a real-time object detection system. We introduce the NVIDIA Jetson TX2 in Figure 3.

NVIDIA Jetson TX2
The NVIDIA Jetson TX2 comprises a 6-core ARMv8 64-bit CPU complex and a 256core NVIDIA Pascal architecture GPU. The CPU complex includes a dual-core Denver2 processor, a quad-core ARM Cortex-A57, 8 GB of LPDDR4 memory, and a 128 bit interface, making it ideal for applications with low power and high computing performance. Therefore, we chose this edge computing device to design and implement a real-time object detection system. We introduce the NVIDIA Jetson TX2 in Figure 3.

NVIDIA Jetson TX2
The NVIDIA Jetson TX2 comprises a 6-core ARMv8 64-bit CPU complex and a 256core NVIDIA Pascal architecture GPU. The CPU complex includes a dual-core Denver2 processor, a quad-core ARM Cortex-A57, 8 GB of LPDDR4 memory, and a 128 bit interface, making it ideal for applications with low power and high computing performance. Therefore, we chose this edge computing device to design and implement a real-time object detection system. We introduce the NVIDIA Jetson TX2 in Figure 3.

MC-LCNN
The MC-LCNN is a lightweight network (11.3 M) that can achieve real-time detection of complex unstructured environments (light changes, occlusions, and overlaps). The network structure is mainly constructed based on the backbone, neck, and head, as shown in Figure 4. In the backbone, the main network utilizes the proposed MC-ResNetv1 incorporating the CBM module and SPP module in this component. In the neck, the main network uses the proposed MC-ResNetv2 with the CBL module embedded. In the head, a feature pyramid network (FPN) feature fusion strategy is employed. Furthermore, several strategies were used throughout the network to improve the training robustness, including exponential moving average (EMA), larger batch size, DropBlock regularization, and generalized focal loss.

MC-LCNN
The MC-LCNN is a lightweight network (11.3 M) that can achieve real-time detection of complex unstructured environments (light changes, occlusions, and overlaps). The network structure is mainly constructed based on the backbone, neck, and head, as shown in Figure 4. In the backbone, the main network utilizes the proposed MC-ResNetv1 incorporating the CBM module and SPP module in this component. In the neck, the main network uses the proposed MC-ResNetv2 with the CBL module embedded. In the head, a feature pyramid network (FPN) feature fusion strategy is employed. Furthermore, several strategies were used throughout the network to improve the training robustness, including exponential moving average (EMA), larger batch size, DropBlock regularization, and generalized focal loss.

MC-ResNetv1 and MC-ResNetv2
The main challenge in implementing lightweight models is that under fixed computational budgets (FLOPs), only a restricted amount of feature channels can be afforded. To increase the number of channels at low computational budgets, we employed a 1 × 1 convolution and a bottleneck structure to achieve information exchange between different channels. The shape of the 1 × 1 convolution was determined by the input channels c1 and output channels c2. Thus, the FLOPs of the 1 × 1 convolution could be calculated as , where h and w are the spatial sizes of the feature maps. When the cache in the computing device is sufficiently large to store all the feature maps and parameters, the memory access cost . Based on the mean inequality, we obtain the following: Accordingly, the memory access cost has a minimum value given by the FLOPs. It reaches its minimum value when the number of input and output channels are equal.
A 1 × 1 convolution reduces the computational burden by replacing dense convolution with sparse convolution. On the one hand, it allows more channels to be used at fixed FLOPs and increases the network capacity. However, on the other hand, the increase in the number of channels leads to a higher memory access cost. The relationship between memory access cost and FLOPs for 1 × 1 group convolution is as follows: ,

MC-ResNetv1 and MC-ResNetv2
The main challenge in implementing lightweight models is that under fixed computational budgets (FLOPs), only a restricted amount of feature channels can be afforded. To increase the number of channels at low computational budgets, we employed a 1 × 1 convolution and a bottleneck structure to achieve information exchange between different channels. The shape of the 1 × 1 convolution was determined by the input channels c 1 and output channels c 2 . Thus, the FLOPs of the 1 × 1 convolution could be calculated as B = hwc 1 c 2 , where h and w are the spatial sizes of the feature maps. When the cache in the computing device is sufficiently large to store all the feature maps and parameters, the memory access cost (mac) = hw(c 1 + c 2 ) + c 1 c 2 . Based on the mean inequality, we obtain the following: Accordingly, the memory access cost has a minimum value given by the FLOPs. It reaches its minimum value when the number of input and output channels are equal.
A 1 × 1 convolution reduces the computational burden by replacing dense convolution with sparse convolution. On the one hand, it allows more channels to be used at fixed FLOPs and increases the network capacity. However, on the other hand, the increase in the number of channels leads to a higher memory access cost. The relationship between memory access cost and FLOPs for 1 × 1 group convolution is as follows: where g denotes the number of groups, and B = hwc 1 c 2 /g stands for FLOPs. Given the fixed input shape c 1 × h × w and the computational cost B, the memory access cost increases with the growth of g. Both 1 × 1 convolution and bottleneck structures increase the memory access cost. This cost is not negligible, especially for lightweight networks. Consequently, to obtain ze high model capacity and efficiency, the critical issue is how to keep numerous equal-width channels without either dense convolution or many groups. To achieve the above, we designed the MC-ResNetv1 module. We introduced a simple operator named Focus, where Plants 2022, 11, 838 6 of 17 the input is split into two branches at the beginning of each unit. One branch uses a shortcut design, where half of the feature channels directly passes through the block and joins the next block, which can be considered as functional reuse. The other branch comprises two convolutions with the same input and output channels. Moreover, another MC-ResNetv2 module was designed, where the Focus operation was removed and thereby the number of output channels was doubled. At the same time, the original shortcut design was substituted with two convolutions. The blocks were repeatedly stacked to construct the entire network. Therefore, 3 × 3 convolutions are followed by an additional 1 × 1 convolutional layer to blend the features, and the number of channels in each block is scaled to generate a network of different complexities. Not only that, the 1 × 1 convolution removes computational bottlenecks by reducing the dimensionality of the module, which is otherwise constraining the size of the network. This not only increases the depth of the network but also increases the width of the network without significantly affecting performance. To verify the performance of MC-ResNetv1 and MC-ResNetv2, we implemented ablation experiments, as outlined in Section 3.2.

Generalized Focal Loss
Focal loss [14] is designed for object detection tasks with an imbalance between the foreground and background classes, and Equation (3) is as follows: where y ∈ {1, 0} denotes the ground truth class, p ∈ {1, 0} indicates the estimated probability of the class labeled as y = 1, and γ represents an adjustable focusing parameter.
To be specific, focal loss comprises a dynamically scaling factor part (1 − p t ) γ and a standard cross-entropy part − log(p t ). Due to the presence of class imbalance problems, we considered extending the two components of focal loss, known as the quality focus loss (Q): where σ = y means the global minimum solution of the quality focus loss. |y − σ| β is a moderating factor that goes to 0 when the quality estimate becomes accurate, i.e., σ → y , and the loss of well-estimated samples is downgraded, where the parameter β smoothly controls the downgraded rate. We used the relative offset from the location to the four sides of the bounding box as the regression objective. The bounding box regression models the regression label y as Dirac delta distribution δ( The integral of y is as follows: We learnt the underlying generic distribution P(x) directly without inserting any other prior factors instead of the Dirac delta assumption. Based on the range of labels for the minimum y 0 and maximum y 0 (y 0 ≤ y ≤ y n , n ∈ N + ), we can estimateŷ from the model: To be consistent with the network structure, we discretized the range [y 0 ,y n ] as a set of {y 0 , y 1 , . . . , y i , y i+1 , . . . , y n−1 , y n }, converting the integral of the continuous domain into a discrete representation. Thus, according to the discrete distribution property ∑ n i=0 P(y i ) = 1, the regression valueŷ can be formulated as follows: Consequently, P(x) can be simply achieved by the softmax S(·) layer, where P(y i ) is represented as S i .
To encourage high probability values close to the target y to optimize P(x), we introduced a distribution focus loss. By expanding the probabilities of y i and y i+1 , the network is forced to concentrate quickly on values close to the label y. We defined the distributional focus loss by applying the entire cross-entropy component of the mass focus loss. We defined distribution focus loss by applying the whole cross-entropy part of quality focus loss: The purpose of distribution focus loss is to expand the probability of the values around the target y. The global minimum solution of distribution focus loss, i.e., S i = , ensures that the estimated regression targetŷ is infinitely close to the corresponding label y, i.e.,ŷ = ∑ n j=0 P y j y j = S i y i + S i+1 y i+1 = Quality focus loss and distribution focus loss can be unified into a general form known as generalized focal loss. Suppose a model has probability estimates for two variables y l , y r (y l < y r ) as p y l , p y r p y l ≥ 0, p y r ≥ 0, p y l + p y r = 1 , and the final prediction of their linear combination isŷ = y l p y l + y r p y r (y l ≤ŷ ≤ y r ). The corresponding label y of the predictedŷ also satisfies y l ≤ y ≤ y r . With the absolute distance y −ŷ β (β ≥ 0) as the moderating factor, the equation of generalized focal loss (G) is as follows: G p y l , p y r = − y − y l p y l + y r p y r β (y r − y) log p y l + (y − y l ) log p y r Generalized focal loss p y l , p y r reaches a global minimum at p * y l = y r −y y r −y l and p * y r = y−y l y r −y l , which also implies thatŷ exactly matches the continuous label y, i.e.,ŷ = y l p * y l + y r p * y r = y. The modified detector differs from the former detector in two respects. First, we fed the classification scores directly as NMS scores during the inference process without multiplication if any separate quality prediction existed. Second, the final layer of the regression branch used to predict the location of each bounding box now has n + 1 outputs rather than l output, resulting in negligible additional computational cost. We can define the training loss L in terms of generalized focal loss as follows: where L Q is quality focus loss, and L D is distribution focus loss. L B stands for GIoU loss, and λ 0 and λ 1 refer to the balance weights of L Q and L D , respectively. Here, 1 {c * z >0} is the indicator function, where the value is 1 if c * z > 0 and 0 otherwise.

CPU-GPU Multithreaded Pipeline Design
To make full use of GPU computational power, the aim was to design a real-time object detection system on the NVIDIA Jetson TX2, a low-power embedded heterogeneous GPU platform. Due to the low power of the TX2, energy consumption can be controlled by minimizing the calculations during system operation, and the inference speed can be improved simultaneously. Computational reduction often leads to a decline in detection accuracy, and the critical issue to be tackled is how to increase the inference speed while retaining system accuracy.
With a multicore CPU on the TX2, we maximized the computational power of the GPU via a multithreaded CPU-GPU pipeline design, where the CPU is primarily responsible for processing more logical tasks and the GPU is used to process high-density floating-point calculations. The data is transferred from the CPU memory to the GPU graphics memory. The GPU finishes processing the data for calculation and then transfers the results out to the CPU memory. Calculation of the detection time of the system for the object target starts with reading the image and ends with the system completing the detection and returning the result of the object and its position. Using the time detection function to count the inference time for each part of the code, we found that the time spent on the object detection process was primarily in the CPU image preprocessing and GPU network prediction stages, whereas the time for the final CPU output detection results was negligible. By further statistical analysis of the TC-YOLO network execution on the CPU and GPU, the time taken to process each frame was approximately 21 ms in single-threaded operation, with 12.6 ms executed on the GPU and 8.4 ms on the CPU. Considering that the CPU on the TX2 development board is multicore, an attempt was made to maximize the use of the computational power of the GPU by opening multiple threads for scheduling and trying to keep the GPU in constant computation. Here, one thread performs the GPU task, and another thread conducts the CPU image reading and preprocessing tasks simultaneously. When the first thread finishes the GPU computation, the second thread can immediately start the GPU computation task. The whole process carries out the GPU computation task of the previous image and the CPU preprocessing stage of the next image at the same time, so the time for preprocessing each image can be saved during the detection. Depending on the dataset and input requirements, the number of threads opened can be adjusted. We used two threads for pipelined detection depending on the current application. The final time spent on the whole process entirely hides the CPU processing time, and only the GPU processing time needs to be calculated to detect the images. In addition, the improvements proposed herein do not involve changes to the network structure and thus have no impact on the accuracy of the system.

Evaluation Metrics
To define the detection results in more detail, we introduced a series of evaluation metrics based on average precision (AP), including AP 50 , AP 75 , AP S , AP M , and AP L , where AP 50 denotes the AP at intersection over union = 0.5, and AP 75 indicates the AP at intersection over union = 0.75. AP S indicates the AP with detection area less than 1394 (34 × 41), AP M indicates the AP with detection area larger than 1394 (34 × 41) and smaller than 2888 (76 × 38), and AP L refers to the AP with detection area larger than 2888 (76 × 38). The equation of AP is as follows: (11) where N denotes the number of test images, P(i) represents the precision value at i images, and recall (i) shows the change in recall between k and k − 1 images.

Experimental Setup
The experiments were conducted on a server with NVIDIA Tesla V100, CUDA 11.2. The basic detection frameworks were MC-ResNetv1 and MC-ResNetv2. During training, the key hyperparameters were set as follows: learning rate = 0.0002, momentum = 0.8, gamma = 0.1, and weight decay = 0.0002. The optimizer used was stochastic gradient descent (SGD). Moreover, to ensure the test results would be more convincing, we executed the whole test process 10 times, and the final test results were averaged.
In different versions of the same model, as the input size of the image gets larger, the network needs more layers (deeper and wider) to expand the receptive fields and more channels to capture finer-grained features. Thus, the network depth or width of the backbone is typically different in various versions of the same model; in other words, their weight files are varied. If we simplistically resize the inputs of different versions to the same resolution, it would be unfair to these state-of-the-art models. To this end, we kept all parameters of the comparison models, including input size, backbone, and weights, unchanged, allowing all models to perform well. It is worth noting that the input size of the proposed model was 416 × 416 (we reshaped the 1080 × 1920 resolution to 416 × 416 resolution) to balance performance and inference speed.

The Impact of Data Augmentation on the MC-LCNN
Data augmentation is an integral part of the whole training process and has direct impact on the final detection accuracy. We compared 14 influential data augmentation methods [15][16][17][18] and combined them to determine the final approach for dataset augmentation in this study. First, we tested the 14 enhancement methods in turn and then selected the top four performing methods to combine and test them. Typically, it is difficult for an augmentation method that performs poorly when working alone to suddenly become superior when combined with other methods, so we only considered the top four augmentation methods that perform well. The test results are shown in Table 2.
We can clearly observe that Cutout, Blur, Flip, and Rotation achieved excellent performance with AP 50 of 91.14%, 90.69%, 88.59%, and 88.38%, respectively. Surprisingly, the three most advanced data augmentation methods, namely Mixup, Cutmix, and Mosaic, all showed mediocre performance, probably because the image features of medicinal chrysanthemums are mostly similar, such as color, texture, etc. Thus, using complex augmentation methods can generate a large amount of redundant local information and cause overfitting. It is worth noting that the performance of Blur ranked second among all the enhancement methods, probably due to the fact that Blur makes the whole dataset increase with new features rather than redundant ones, which greatly improves the robustness of the model. Furthermore, when we combined Cutout and Blur together, the AP 50 improved from 91.14% to 93.06%, an encouraging result. In summary, we combined Cutout and Blur as the data augmentation methods in this study.

Ablation Experiments
MC-LCNN employs several modules, including the proposed MC-ResNet, DropBlock, EMA, SPP, and CBM. We used ablation experiments to verify the performance of these modules. First, to validate the performance of the MC-ResNet module, we replaced MC-ResNet with 24 feature extraction networks. Furthermore, to validate the performance of DropBlock, EMA, and SPP, we removed these modules. Finally, we verified the performance of CBM by sequentially increasing the number of CBM. It is worth noting that MC-LCNN is essentially a convolutional neural network, so CBM cannot be completely removed. The results of the ablation experiments are shown in Table 3.
First, as observed in Table 3, MC-ResNet outperformed the 24 feature extraction networks, and the AP 50 , AP S , AP M , and AP L were 2.13%, 3.29%, 3.14%, and 2.36% higher than the suboptimal CSPRetNeXt module, respectively, showing that MC-ResNet had the most prominent ability for small object feature extraction. Not only that, the inference speed (FPS) of MC-ResNet was an impressive 11.07% higher than that of CSPDarknet53 (the module with the second highest inference speed after MC-ResNet). Second, after adding DropBlock, EMA, and SPP, the AP 50 of MC-LCNN improved by 3.4%, 2.12%, and 6.81% and the inference speed (FPS) improved by 2.4%, 2.59%, and 7.99%, respectively. Because SSP can receive any size of feature map input and output it to a fixed size of feature vector, this can significantly improve the detection precision and inference speed of the model. Finally, we verified that the optimal performance of MC-LCNN could be achieved using a single CBM module. When several CBM modules were employed, the AP 50 of the whole model showed a slight increase. When using four CBM modules, the AP 50 marginally increased by 0.27%, but the inference speed FPS significantly decreased by 19.45. To intuitively observe the image features, we show the visualization process of some images in MC-LCNN in Figure 5.  Finally, we verified that the optimal performance of MC-LCNN could be achieved using a single CBM module. When several CBM modules were employed, the AP50 of the whole model showed a slight increase. When using four CBM modules, the AP50 marginally increased by 0.27%, but the inference speed FPS significantly decreased by 19.45. To intuitively observe the image features, we show the visualization process of some images in MC-LCNN in Figure 5.

Comparisons with State-of-the-Art Detection Methods
In this section, we present a comprehensive comparison of the latest 13 object detection frameworks (54 models) with the proposed MC-LCNN. The results are shown in Table 4.
First, our goal was to build a lightweight network; hence, the inference speed of the model was crucial to us. The inference speed of MC-LCNN (FPS = 109.28) was second only to PP-YOLOv2 (FPS = 110.54) with an input size of 320 × 320, ranking second among the 54 models in terms of inference speed. However, the AP 50 of MC-LCNN (93.06%) was 7.08% higher than that of PP-YOLOv2 (85.98%) with an input size of 320 × 320, showing a clear advantage. Secondly, although the inference speed of MC-LCNN was not the most superior, the detection accuracy (AP 50 = 93.06%) was the highest among the 54 models and 3.43% higher than the suboptimal YOLOX-X (AP 50 = 89.63%), which is an encouraging result. Not only that, in MC-LCNN, the detection precision for different anchor box sizes (AP S = 69.63%, AP M = 76.42%, and AP L = 88.89%) was 4.41%, 2.88%, and 2.03% higher than that of the suboptimal YOLOX-X (AP S = 65.22%, AP M = 73.54%, and AP L = 86.86%), respectively. The performance of MC-LCNN was more prominent for small-sized anchor box detection, which is critical for robotic systems that operate in natural environments. Because of path planning constraints, small-sized anchor box detection is particularly relevant when the robot picks distant chrysanthemums. Finally, according to the improvement strategy in Section 2.4, we tested MC-LCNN on a heterogeneous GPU platform, NVIDIA Jetson TX2, and the example is shown in Figure 6.  First, our goal was to build a lightweight network; hence, the inference speed of the model was crucial to us. The inference speed of MC-LCNN (FPS = 109.28) was second only to PP-YOLOv2 (FPS = 110.54) with an input size of 320 × 320, ranking second among the 54 models in terms of inference speed. However, the AP50 of MC-LCNN (93.06%) was 7.08% higher than that of PP-YOLOv2 (85.98%) with an input size of 320 × 320, showing a clear advantage. Secondly, although the inference speed of MC-LCNN was not the most superior, the detection accuracy (AP50 = 93.06%) was the highest among the 54 models and 3.43% higher than the suboptimal YOLOX-X (AP50 = 89.63%), which is an encouraging result. Not only that, in MC-LCNN, the detection precision for different anchor box sizes (APS = 69.63%, APM = 76.42%, and APL = 88.89%) was 4.41%, 2.88%, and 2.03% higher than that of the suboptimal YOLOX-X (APS = 65.22%, APM = 73.54%, and APL = 86.86%), respectively. The performance of MC-LCNN was more prominent for small-sized anchor box detection, which is critical for robotic systems that operate in natural environments. Because of path planning constraints, small-sized anchor box detection is particularly relevant when the robot picks distant chrysanthemums. Finally, according to the improvement strategy in Section 2.4, we tested MC-LCNN on a heterogeneous GPU platform, NVIDIA Jetson TX2, and the example is shown in Figure 6. The precision of the model remained unchanged, and the inference speed of the whole model increased by 2FPS as it benefited from the multithreaded pipeline design of the CPU-GPU. Unfortunately, we assumed that the design would completely hide CPU processing time and thus only counted GPU processing time, resulting in an improvement in detection time of approximately 19 FPS. However, due to FPS calculation and communication loss between multiple threads, the actual improvement in detection speed was different from the ideal case, although it still somewhat saved the CPU preprocessing time. The test results on the NVIDIA Jetson TX2 are shown in Figure 7. The precision of the model remained unchanged, and the inference speed of the whole model increased by 2FPS as it benefited from the multithreaded pipeline design of the CPU-GPU. Unfortunately, we assumed that the design would completely hide CPU processing time and thus only counted GPU processing time, resulting in an improvement in detection time of approximately 19 FPS. However, due to FPS calculation and communication loss between multiple threads, the actual improvement in detection speed was different from the ideal case, although it still somewhat saved the CPU preprocessing time. The test results on the NVIDIA Jetson TX2 are shown in Figure 7.

Discussion
In response to the three issues proposed in the introduction, we have compared the proposed MC-LCNN with the studies in Table 1. For issue 1, from a detection accuracy perspective, the inference speed of MC-LCNN (9.15 ms) was slightly faster than the Liu et al. research (10 ms) [12], but the detection accuracy (AP50) was tremendously improved by 15.06%. From an inference speed perspective, the detection accuracy of MC-LCNN (AP50 = 93.06%) was 3.06% higher than the research by Yang et al. (AP50 = 90%) [11], with

Discussion
In response to the three issues proposed in the introduction, we have compared the proposed MC-LCNN with the studies in Table 1. For issue 1, from a detection accuracy perspective, the inference speed of MC-LCNN (9.15 ms) was slightly faster than the Liu et al. research (10 ms) [12], but the detection accuracy (AP 50 ) was tremendously improved by 15.06%. From an inference speed perspective, the detection accuracy of MC-LCNN (AP 50 = 93.06%) was 3.06% higher than the research by Yang et al. (AP 50 = 90%) [11], with a significant improvement in inference speed from 0.7 s to 9.15 ms. MC-LCNN achieved the first highly accurate real-time testing work in the world for medicinal chrysanthemums. For issue 2, it is clear from Table 1 that most studies were tested in ideal environments or under illumination variations. In this study, the dataset was collected from natural environments, including complex unstructured environments, such as illumination variations, overlaps, and occlusions, thus significantly improving the robustness of the model. For issue 3, we tested MC-LCNN embedded in a low-power edge computing device, the NVIDIA Jetson TX2. Not only that, we used a multithreaded CPU-GPU pipeline design to improve the inference speed of MC-LCNN.
The proposed MC-LCNN has apparent advantages but also has shortcomings that need to be addressed. First, the inference speed of MC-LCNN was not optimal among all the compared models, and inference speed is crucial for robotic picking. Not only that, when the proposed model was embedded in the Jetson TX2, it took around 0.6 s to test a single image, which is an acceptable but not surprising result. Furthermore, actual unstructured environments involve more than just illumination variations, overlaps, and occlusions, and we need to collect further different scenarios to improve the robustness of the model.

Conclusions
In this work, we propose a new lightweight convolutional neural network, named MC-LCNN, for detecting medicinal chrysanthemums at the bud stage under complex unstructured environments (illumination variations, overlaps, and occlusions). We collected 4000 original images (1080 × 1920) as the dataset. In the NVIDIA Tesla V100 GPU environment, the AP 50 of the test dataset reached 93.06%, and the inference speed was 109.28 FPS. The optimal data enhancement strategy for training MC-LCNN was the combination of Cutout and Blur. Furthermore, we compared the proposed MC-LCNN with 13 state-of-theart object detection frameworks (54 models). MC-LCNN achieved the highest AP 50 and was second to the optimal PP-YOLOv2 in terms of inference speed. Finally, we embedded MC-LCNN into the NVIDIA Jetson TX2 for real-time object detection and improved the inference speed by 2FPS through a multithreaded CPU-GPU pipeline design. The proposed MC-LCNN has the potential to be integrated into a selective picking robot for automatic picking of medicinal chrysanthemums via NVIDIA Jetson TX2 in the future.